Goto

Collaborating Authors

 unified model



Uni-Hema: Unified Model for Digital Hematopathology

Rehman, Abdul, Rasool, Iqra, Imran, Ayisha, Ali, Mohsen, Sultani, Waqas

arXiv.org Artificial Intelligence

Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. T o overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and linguistic representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.



AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

Chen, Tuochao, Veluri, Bandhav, Gong, Hongyu, Gollakota, Shyamnath

arXiv.org Artificial Intelligence

Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.




A Unified Model for Human Mobility Generation in Natural Disasters

Long, Qingyue, Wang, Huandong, Wang, Qi Ryan, Li, Yong

arXiv.org Artificial Intelligence

Human mobility generation in disaster scenarios plays a vital role in resource allocation, emergency response, and rescue coordination. During disasters such as wildfires and hurricanes, human mobility patterns often deviate from their normal states, which makes the task more challenging. However, existing works usually rely on limited data from a single city or specific disaster, significantly restricting the model's generalization capability in new scenarios. In fact, disasters are highly sudden and unpredictable, and any city may encounter new types of disasters without prior experience. Therefore, we aim to develop a one-for-all model for mobility generation that can generalize to new disaster scenarios. However, building a universal framework faces two key challenges: 1) the diversity of disaster types and 2) the heterogeneity among different cities. In this work, we propose a unified model for human mobility generation in natural disasters (named UniDisMob). To enable cross-disaster generalization, we design physics-informed prompt and physics-guided alignment that leverage the underlying common patterns in mobility changes after different disasters to guide the generation process. To achieve cross-city generalization, we introduce a meta-learning framework that extracts universal patterns across multiple cities through shared parameters and captures city-specific features via private parameters. Extensive experiments across multiple cities and disaster scenarios demonstrate that our method significantly outperforms state-of-the-art baselines, achieving an average performance improvement exceeding 13%.


From Masks to Worlds: A Hitchhiker's Guide to World Models

Bai, Jinbin, Lei, Yu, Wu, Hecong, Zhu, Yuchen, Li, Shufan, Xin, Yi, Li, Xiangtai, Tao, Molei, Grover, Aditya, Yang, Ming-Hsuan

arXiv.org Artificial Intelligence

This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a "world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models. The term world model has been used to describe many different ideas: learned environment simulators for reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2019), agents that integrate learned models with planning (Schrittwieser et al., 2020), and large language models that simulate entire societies (Park et al., 2023). Y et despite hundreds of related works, there is no clear consensus on how to actually build a true world model. In this paper, we take a stance: the path is much narrower than it appears. A true world model is not a monolithic entity but a system synthesized from three core subsystems: a generative heart that produces world states, an interactive loop that closes the action-perception cycle in real time, and a persistent memory system that sustains coherence over long horizons. The history of the field can be understood as an evolutionary journey from first mastering these components in isolation to now integrating them. Most works focus on optimizing narrow tasks and drift away from the generative, interactive, and persistent nature required for a true world model. To make this perspective concrete, we chart the historical evolution of world models as a sequence of five stages, shown in Figure 1. It begins with Stage I: Mask-based Models, which established a universal, token-based pretraining paradigm across modalities.


UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

Wang, Fu-Yun, Zhang, Han, Gharbi, Michael, Li, Hongsheng, Park, Taesung

arXiv.org Artificial Intelligence

W e present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-UN/UniRL.


The Telephone Game: Evaluating Semantic Drift in Unified Models

Mollah, Sabbir, Gupta, Rohit, Swetha, Sirnam, Liu, Qingyang, Munir, Ahnaf, Shah, Mubarak

arXiv.org Artificial Intelligence

At every step we observe semantic drift. For example, in the 5th generation, the model fails to generate a convincing suitcase, which also hints at cross-inconsistency. These phenomena are magnified under the multi-generation telephone game evaluation, allowing it to capture more subtle performance differences between models. Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Multimodal Unified Models (UMs) combine visual understanding and generation within a single framework, enabling a wide range of unimodal tasks (e.g., text-to-text, image-to-image) as well as cross-modal tasks (e.g., image-to-text, text-to-image). Despite rapid model progress, UM evaluation remains fragmented. In other words, current single-pass metrics do not assess the retention of entities, attributes, relations, and counts under alternating I2T T2I conversions. We defer unimodal tasks and center our analysis on I2T and T2I tasks as the potential for semantic divergence and its impact on real use is most pronounced on the cross-modal tasks.